Methods and compositions for imputing or predicting genotype or phenotype

ABSTRACT

Methods and compositions to impute or predict genotype, haplotype, molecular phenotype, agronomic phenotypes, and/or coancestry are provided. Methods and compositions provided include using latent space to generate latent space representations or latent vectors that are independent of underlying genotypic or phenotypic data. The methods may include generating a universal latent space representation by encoding discrete or continuous variables derived from genotypic or phenotypic data into latent vectors through a machine learning-based encoder framework. Provided herein are universal methods of parametrically representing genotypic or phenotypic data obtained from one or more populations or sample sets to impute or predict a genotype or phenotype of interest.

CROSS REFERENCE TO RELATED APPLICATIONS

This application claims the benefit of and priority to U.S. ProvisionalApplication No. 62/960,363, filed Jan. 13, 2020, U.S. ProvisionalApplication No. 62/833,497, filed Apr. 12, 2019, and U.S. ProvisionalApplication No. 62/816,719, filed Mar. 11, 2019, each of which isincorporated herein by reference in entirety.

FIELD

This disclosure relates generally to the fields of imputation andprediction.

BACKGROUND OF THE INVENTION

Over the last 60 to 70 years, the contribution of plant breeding toagricultural productivity has been spectacular (Smith (1998) 53rd Annualcorn and sorghum research conference, American Seed Trade Association,Washington, D.C.; Duvick (1992) Maydica 37: 69). This has happened inlarge part because plant breeders have been adept at assimilating andintegrating information from extensive evaluations of segregatingprogeny derived from multiple crosses of elite, inbred lines. Conductingsuch breeding programs requires extensive resources. A commercial maizebreeder, for example, may evaluate 1,000 to 10,000 F3 toperossed progenyderived from 100 to 200 crosses in replicated field trials across widegeographic regions.

SUMMARY

In one embodiment, a universal method of parametrically representinggenotypic or phenotypic association data from a training data setobtained from a population or a sample set to impute or predict agenotype and/or a phenotype in a test data obtained from a testpopulation or a test sample data is provided herein. In some aspects,the method includes generating a universal continuous global latentspace representation by encoding discrete or continuous variablesderived from genome-wide genotypic or phenome-wide phenotypicassociation training data into latent vectors through a machinelearning-based global encoder framework. In some examples, the encoderis an autoencoder. In some examples, the autoencoder is a variationalautoencoder. In some aspects, the machine-learning based encoderframework is a generative adversarial network (GAN). In some aspects,the machine-learning based encoder framework is a neural network.

In some aspects, the global latent space or global latent spacerepresentation is independent of the underlying genotypic or phenotypicassociation used to represent the genetic or phenotypic information. Forexample, the generated latent representations are invariant to theselection of particular genotypic or phenotypic association features. Insome aspects, the method includes generating a local latentrepresentation by encoding a subset of the discrete or continuousvariables derived from the genotypic or phenotypic association trainingdata set into latent vectors through a machine learning-based localencoder framework, where the local latent space or local latent spacerepresentation is generated with inputs from the local encoder and theglobal encoder. In some examples, the local encoder is an autoencoder.In some examples, the autoencoder is a variational autoencoder. In someaspects, the machine-learning based encoder framework is a generativeadversarial network (GAN). In some aspects, the machine-learning basedencoder framework is a neural network.

In some aspects, the method includes decoding the global latentrepresentation and the local latent representation by a local decoder,thereby imputing or predicting the genotype or phenotype of the testdata by the combination of the decoded global latent representation andthe local latent representation.

In some aspects, the genotypic association data includes a collection ofgenotypic markers or single nucleotide polymorphisms (SNPs) from aplurality of a genetically divergent population. The subset of thediscrete variables may be a plurality of SNPs localized to a segment ofthe chromosome. In some aspects, the encoder is based on a neuralnetwork algorithm. In some aspects, the imputed or predicted phenotypeis predicted yield gain. In some aspects, the imputed or predictedphenotype is root lodging, stalk lodging, brittle snap, ear height,grain moisture, plant height, disease resistance, drought tolerance, ora combination thereof. In some aspects, the imputed or predictedgenotype is a plurality of haplotypes. In some aspects, the localdecoder imputes or predicts local high-density (HD) SNPs.

In some aspects, the genotypic association data is obtained frompopulations of plants derived from two or more breeding programs, wherethe breeding programs do not comprise an identical set of markers orSNPs corresponding to the genotypic association data. In some aspects,the local decoder imputes local HD SNPs of one population based on thedecoding of genotypic association data of another population. In someaspects, the local decoder imputes haplotypes for one population basedon the decoding of genotypic association data of another population. Insome aspects, the local decoder imputes or predicts a molecularphenotype including but not limited to gene expression, chromatinaccessibility, DNA methylation, histone modifications, recombinationhotspots, genomic landing locations for transgenes, transcription factorbinding status, or a combination thereof. In some aspects, the localdecoder imputes or predicts population coancestry for one or more of thetest populations.

Also provided herein in an embodiment is a universal method ofparametrically representing genotypic or phenotypic association datafrom a training data set obtained from a population or a sample set toinfer a characteristic of interest, e.g. a desirable characteristic, intest data obtained from a test population or a test sample data. In someaspects, the method includes generating a universal continuous globallatent space representation by encoding discrete or continuous variablesderived from genome-wide genotypic or phenome-wide phenotypicassociation training data into latent vectors through a machinelearning-based global encoder framework, where the global latent spaceor global latent space representation is independent of the underlyinggenotypic or phenotypic association. In some examples, the globalencoder is an autoencoder. In some examples, the autoencoder is avariational autoencoder. In some aspects, the machine-learning basedencoder framework is a generative adversarial network (GAN). In someaspects, the machine-learning based encoder framework is a neuralnetwork. In some aspects, the method includes decoding the global latentrepresentation by a global decoder, thereby inferring the desirablecharacteristic of the test data by the decoded global latentrepresentation.

In some aspects, the characteristic of interest, e.g. a desirablecharacteristic, is without limitation coancestry determination of two ormore populations of plants or predicting yield gain or an agronomicphenotype of interest. In some aspects, the encoder is based on a neuralnetwork algorithm.

Also provided herein is a universal method of developing universalrepresentation of genotypic or phenotypic data that includes receivingby a first neural network one or more training genotypic or phenotypicdata, where the first neural network includes a global encoder. In someaspects, the method includes encoding by the global encoder, theinformation from one or more training genotypic or phenotypic data intolatent vectors through a machine-learning based neural network trainingframework. In some aspects, the method includes providing the encodedlatent vectors (generated from other genotypic or phenotypic data) to asecond machine-learning based neural network, where the second neuralnetwork includes a decoder. In some aspects, the method includestraining the decoder to predict a genotype or phenotype of interest forthe encoded latent vectors based on a pre-specified or learned objectivefunction. In some aspects, the method includes decoding by the decoderthe encoded latent vector for the objective function. In some aspects,the method includes providing an output for the objective function ofthe decoded latent vector.

Also provided herein is a method of selecting an attribute of interestbased on genotypic or phenotypic data. In some aspects, the methodincludes receiving by a first neural network one or more training globalgenotypic or phenotypic data, where the first neural network includes aglobal encoder. In some examples, the global encoder is an autoencoder.In some examples, the autoencoder is a variational autoencoder. In someaspects, the machine-learning based neural network is a generativeadversarial network (GAN).

In some aspects, the method includes encoding by the global encoder,genotypic or phenotypic information from one or more training genotypicor phenotypic data into latent vectors. In some aspects, the methodincludes training the global encoder using the latent vectors to learnunderlying genotypic or phenotypic correlations and/or relatedness. Insome aspects, the method includes receiving by a second neural networkone or more training local genotypic or phenotypic data, where the localgenotypic or phenotypic data is directed to a subset of global genotypicor phenotypic data that corresponds to a certain attribute of interest,where the second neural network includes a local encoder. In someexamples, the local encoder is an autoencoder. In some examples, theautoencoder is a variational autoencoder. In some aspects, the methodincludes encoding by the local encoder, the genotypic or phenotypicinformation from the one or more training local genotypic or phenotypicdata into latent vectors. In some aspects, the method includes trainingthe local encoder using the latent vectors to learn underlying genotypiccorrelations and/or relatedness for the attribute of interest. In someaspects, the method includes providing the encoded latent vectors fromthe global encoder and/local encoder to a third neural network, wherethe third neural network includes a decoder. In some aspects, the methodincludes training the decoder to predict the attribute of interest forthe encoded latent vectors from the global encoder and/the local encoderusing a pre-specified or learned objective function. In some aspects,the method includes decoding by the decoder, the encoded latent vectorsfor the objective function. In some aspects, the method includesproviding an output for the objective function of the decoded latentvector.

The decoder may include one or more decoders. In some aspects, thedecoder is a local decoder. In some aspects, the decoder is a globaldecoder and decodes the encoded latent vectors from the global encoder.In some aspects, the global training genotypic data includes markersacross the genome. In some aspects, the local genotypic data is from aspecific chromosomal genomic region of interest or allele. In someaspects, the method includes training the global encoder and decodersimultaneously.

In some aspects, the local attribute may include without limitationSNPs, alleles, markers, quantitative trait loci (QTLs), gene expression,phenotypic variation, metabolite level, or combinations thereof. In someaspects, the encoder may be an autoencoder. In some aspects, theautoencoder is a variational autoencoder.

In some aspects, the training genotypic data includes without limitationSNPs or indels (INsertions/DELetions) sequence information. In someaspects, the training genotypic or phenotypic data includes sequenceinformation from in silico crosses. In some aspects, the encoder weightsare updated relative to a reconstruction error so that the traininggenotypic or phenotypic data information is separated within the latentspace. In some aspects, the decoder is trained on existing genotypic orphenotypic data.

Also provided herein is a computer system for generating genotypic orphenotypic data determinations. In one embodiment, the system includes afirst neutral network that includes an encoder configured to encodegenotypic or phenotypic information from one or more training genotypicor phenotypic data into universal latent vectors, where the encoder hasbeen trained to represent genotypic or phenotypic associations through amachine-learning based neural network framework and a second neuralnetwork includes decoder configured to decode the encoded latent vectorsand generate an output for an objective function. In some aspects, theencoder may be an autoencoder. In some aspects, the autoencoder is avariational autoencoder.

Also provided herein in an embodiment is a universal method ofparametrically representing genotypic or phenotypic data obtained from apopulation or a sample set to impute or predict a desired genotypeand/or phenotype. In some aspects, the method includes generating auniversal latent space representation by encoding discrete or continuousvariables derived from genotypic or phenotypic data into latent vectorsthrough a machine learning-based encoder framework, where the latentspace or latent space representation is independent of the underlyinggenotypic or phenotypic data. In some aspects, the method includesdecoding the latent representation by a decoder, thereby imputing orpredicting the desired genotype or phenotype by the decoded latentrepresentation.

In some aspects, the genotypic data is a collection of genotypic markersor single nucleotide polymorphisms (SNPs) from a plurality ofgenetically divergent populations. In some aspects, a subset of thediscrete variables is a plurality of SNPs localized to a segment of achromosome. In some aspects, the encoder is based on a neural networkalgorithm. In some aspects, the imputed or predicted phenotype is yieldgain, root lodging, stalk lodging, brittle snap, ear height, grainmoisture, plant height, disease resistance, drought tolerance, or acombination thereof.

In some aspects, the imputed or predicted genotype is a plurality ofhaplotypes.

In some aspects, the decoder imputes or predicts SNPs, such as localhigh-density (HD) SNPs, and/or indels.

In some aspects, genotypic data is obtained from populations of plantsderived from two or more breeding programs, where the breeding programsdo not have an identical set of markers or SNPs corresponding to thegenotypic data. In some aspects, the decoder imputes or predicts localHD SNPs of one population based on the decoding of genotypic data ofanother population. In some aspects, the decoder imputes or predictshaplotypes for one population based on the decoding of genotypic data ofanother population.

In some aspects, the decoder imputes or predicts a molecular phenotypeselected from gene expression, chromatin accessibility, DNA methylation,histone modifications, recombination hotspots, genomic landing locationsfor transgenes, transcription factor binding status, or a combinationthereof. In some aspects, the decoder imputes or predicts populationcoancestry for one or more of the populations.

Also provided herein is a computer system for generating genotypic orphenotypic data determinations. In one embodiment, the system includes afirst network that includes an encoder configured to encode genotypic orphenotypic information from one or more training genotypic or phenotypicdata into universal latent vectors, where the encoder has been trainedto represent genotypic or phenotypic associations through amachine-learning based network framework and a second network includes adecoder configured to decode the encoded latent vectors and generate anoutput for an objective function. In some aspects, the encoder may be anautoencoder. In some aspects, the autoencoder is a variationalautoencoder. In some aspects, the machine-learning based neural networkframework is a generative adversarial network (GAN). In some aspects,the machine-learning based neural framework is a neural network.

Also provided herein is a computing device for training a neural networkfor translation between genotyping platforms. In one embodiment, thecomputing device includes a memory and one or more processors. The oneor more processors configured to obtain training data associated with atleast two populations from the genotyping platforms; generate a firstlatent space representation by encoding variables derived from thetraining data into a first set of latent vectors using a first encodermachine learning network; generate a second latent representation byencoding a subset of the variables from the training data into a secondset of latent vectors using a second encoder machine learning network;combine the global latent representation and the local latentrepresentation to train a decoder machine learning network; and decodeone or more latent vectors from the combined global and local latentrepresentations to impute or predict a genotype or a phenotype of thetraining data corresponding to the one or more latent vectors using thedecoder machine learning network.

In some embodiments, the training data may include genome-wide genotypicassociation training data and/or phenome-wide phenotypic associationtraining data.

In some embodiments, the genome-wide genotypic association training datamay include genotypic markers, indels, and/or single nucleotidepolymorphisms (SNPs) from a plurality of genetically divergentpopulations.

In some embodiments, the subset of the variables may be a plurality ofindels and/or single nucleotide polymorphisms (SNPs) localized to asegment of a chromosome.

In some embodiments, the genome-wide genotypic association training datamay be obtained from populations of plants derived from two or morebreeding programs. The breeding programs may not include an identicalset of markers, indels, and/or single nucleotide polymorphisms (SNPs)corresponding to the genotypic association data.

In some embodiments, the first encoder machine learning network mayinclude a global variational autoencoder framework.

In some embodiments, the second encoder machine learning network mayinclude a local variational autoencoder framework.

In some embodiments, the first latent space representation may beindependent of the underlying genotypic or phenotypic association.

In some embodiments, the imputed or predicted phenotype may be predictedyield gain.

In some embodiments, the imputed or predicted phenotype may be rootlodging, stalk lodging, brittle snap, ear height, grain moisture, plantheight, disease resistance, and/or drought tolerance.

In some embodiments, the imputed or predicted genotype may be aplurality of haplotypes.

In some embodiments, the imputed or predicted genotype may be localhigh-density (HD) SNPs.

In some embodiments, to decode the one or more latent vectors from thecombined global and local latent representations may include to decodethe one or more latent vectors from the combined global and local latentrepresentations to impute or predict local high-density (HD) SNPs of afirst population based on the decoding of genome-wide genotypicassociation training data of a second population.

In some embodiments, to decode the one or more latent vectors from thecombined global and local latent representations may include to decodethe one or more latent vectors from the combined global and local latentrepresentations to impute or predict haplotypes for a first populationbased on the decoding of genotypic association data of a secondpopulation.

In some embodiments, the imputed or predicted phenotype may include geneexpression, chromatin accessibility, DNA methylation, histonemodifications, recombination hotspot, genomic landing locations fortransgenes, and/or transcription factor binding status.

In some embodiments, to decode the one or more latent vectors from thecombined global and local latent representations may include to decodethe one or more latent vectors from the combined global and local latentrepresentations to impute or predict population coancestry for one ormore of the test populations of the training data.

Also provided herein is a system for training a neural network fortranslation between genotyping platforms is provided. The systemincludes one or more servers and a computing device communicativelycoupled to the one or more servers. Each of the one or more serverstoring training data associated with one or more populations. Thecomputing device further includes a memory and one or more processors.The one or more processors are configured to obtain training data;generate a first latent space representation by encoding variablesderived from the training data into a first set of latent vectors usinga first encoder machine learning network; generate a second latentrepresentation by encoding a subset of the variables from the trainingdata into a second set of latent vectors using a second encoder machinelearning network; combine the global latent representation and the locallatent representation to train a decoder machine learning network; anddecode one or more latent vectors from the combined global and locallatent representations to impute or predict a genotype or a phenotype ofthe training data corresponding to the one or more latent vectors usingthe decoder machine learning network.

BRIEF DESCRIPTION OF THE DRAWINGS

The invention can be more fully understood from the following detaileddescription and the accompanying drawings which form a part of thisapplication.

FIG. 1 is a block diagram illustrating an exemplary computer systemincluding a server and a computing device according to an embodiment asdisclosed herein;

FIG. 2 is a schematic that illustrates the use of marker informationfrom two different platforms to impute markers, haplotypes, or otherinformation, e.g. population genetics, genomic prediction, based onlatent representations of the underlying marker information;

FIG. 3 is a schematic illustrating the steps in one embodiment of amethod of imputing the haplotypes onto germplasm based on latentrepresentations of the underlying SNP information;

FIG. 4 is a flowchart showing one example of imputing separate markerpopulations onto germplasm, where the historical relationships of thegermplasm are unknown, based on latent representations of the underlyingmarker information, and using the resulting imputed information tofacilitate molecular breeding applications, haplotype frameworkgeneration, and/or diversity characterization that is independent of thegenotyping platform;

FIG. 5A and FIG. 5B is a schematic illustrating the steps in oneembodiment of a method of imputing combined production markers from twodifferent groups, Group A and Group B. Steps 1 and 2 are shown in FIG.5A and Step 3 is shown in FIG. 5B;

FIG. 6 is a schematic of an example showing the potential applicationsthat can use the imputed information which is based on common latentrepresentations of the underlying marker information, such as geneticelements, from multiple marker platforms;

FIG. 7 is a schematic of one example of one method of predictingcoancestry between genotypes;

FIG. 8 is a schematic of an example showing that imputed informationbased on common latent representations of the underlying markerinformation from multiple marker platforms can be used in clustering,selection inferences, Fstats, historical demographics;

FIG. 9A is an exemplary graph illustrating how the universal translationof the underlying disjoint marker information may lead to robust,genetically-meaningful representations. FIG. 9A shows areduced-dimensionality visualization of the global latent space of twopopulations (i.e., Population 1 and Population 2) with disjoint markersets. Despite disjointed inputs, the latent representations of agermplasm originated from Population 2 genotyped on the Population 1marker platform leads to clustering with Population 1's genotypedversions of those inbred lines. FIG. 9B and FIG. 9C show Euclideandistances of latent representations (FIG. 9B) and Pearson correlationsof the latent representations (FIG. 9C).

FIG. 10A illustrates how latent representations may be used to predictcoancestry of individuals within and between various populations; FIG.10B shows the latent representations may be also be used to predictwhole-organism phenotypes, as shown here for YIELD within wheat.

FIG. 11 illustrates embodiments of how haplotype information, which canbe imputed based on the universal latent space, may be leveraged forpooling of statistical power in molecular function studies based onreplication at the level of the haplotype;

FIG. 12A-12C illustrates how leveraging of the haplotype informationthrough latent representations results in increased statistical power todetect accessible chromatin based on an ATAC-seq assay. FIG. 12Aillustrates the accuracy and power of the haplotype-pooling approach.FIG. 12B and FIG. 12C illustrate examples of detected peaks usinghaplotype pooling. Grey lines correspond to tissue peaks that were onlydetected using haplotype pooling. FIG. 12B illustrates the detection ofpeaks at alternative TSSs of a single gene, while FIG. 12C illustratesthe detection of peaks at a known major QTL in maize that is 65 kb fromthe nearest protein-coding gene.

FIGS. 13-20 are example inputs and outputs of encoders and decoders.

DETAILED DESCRIPTION

It is to be understood that this invention is not limited to particularembodiments, which can, of course, vary. It is also to be understoodthat the terminology used herein is for the purpose of describingparticular embodiments only, and is not intended to be limiting.Further, all publications referred to herein are each incorporated byreference for the purpose cited to the same extent as if each wasspecifically and individually indicated to be incorporated by referenceherein.

Methods and systems provided herein minimize the labor intensive stepsnormally associated with machine learning application such as forexample, the construction of a feature set that is relevant for thescope of the problem, satisfaction of the constraints of thealgorithm(s) to be used, and minimal prediction error on testing data.

Referring to FIG. 1, a block diagram of a computer system 100 forparametrically representing genotypic or phenotypic association data isshown. To do so, the system 100 may include a computing device 110 and aserver 130 that is associated with a computer system. The system 100 mayfurther include one or more servers 140 that are associated with othercomputer systems such that the computing device 110 may communicate withdifferent computer systems running different platforms. However, itshould be appreciated that, in some embodiments, a single server (e.g.,a server 130) may run multiple platforms. The computing device 110 iscommunicatively coupled to the one or more servers 130, 140 via anetwork 150 (e.g., a local area network (LAN), a wide area network(WAN), a personal area network (PAN), the Internet, etc.).

In use, the computing device 110 may predict genotype and/or phenotypeassociations by training a neural network for universal translationbetween genotyping platforms. More specifically, the computing device110 may obtain data from multiple or potentially disjoint platforms andtranslate the data into a universal, platform independent (e.g.,marker-independent), latent space. For example, in the context ofgenomic characterization, a smooth spatial organization of the latentspace captures varying levels of ancestral relationships that arepresent within a dataset. Genomic variation within a population, such asa plant breeding program, may be characterized by a variety of methods.For example, genotypes are characterized with a common platform thatinterrogates localized variants such as single nucleotide polymorphisms(SNPs) and/or insertions/deletions (indels). Due to the ancestralrecombination and demographic history of the population, these variantstend to co-segregate within linked segments (haplotypes). Further,single genotypes may then be further characterized by the set ofhaplotypes they contain. As described further below, variationalautoencoders (VAEs) may be used to compress the information containedwithin a given set of production markers to a common, marker-invariant,latent space capable of capturing these co-segregation patternsgenome-wide.

In general, the computing device 110 may include any existing or futuredevices capable of training a neural network. For example, the computingdevice may be, but not limited to, a computer, a notebook, a laptop, amobile device, a smartphone, a tablet, wearable, smart glasses, or anyother suitable computing device that is capable of communicating withthe server 130.

The computing device 110 includes a processor 112, a memory 114, aninput/output (I/O) controller 116 (e.g., a network transceiver), amemory unit 118, and a database 120, all of which may be interconnectedvia one or more address/data bus. It should be appreciated that althoughonly one processor 112 is shown, the computing device 110 may includemultiple processors. Although the I/O controller 116 is shown as asingle block, it should be appreciated that the I/O controller 116 mayinclude a number of different types of I/O components (e.g., a display,a user interface (e.g., a display screen, a touchscreen, a keyboard), aspeaker, and a microphone).

The processor 112 as disclosed herein may be any electronic device thatis capable of processing data, for example a central processing unit(CPU), a graphics processing unit (GPU), a system on a chip (SoC), orany other suitable type of processor. It should be appreciated that thevarious operations of example methods described herein (i.e., performedby the computing device 110) may be performed by one or more processors112. The memory 114 may be a random-access memory (RAM), read-onlymemory (ROM), a flash memory, or any other suitable type of memory thatenables storage of data such as instruction codes that the processor 112needs to access in order to implement any method as disclosed herein. Itshould be appreciated that, in some embodiments, the computing device110 may be a computing device or a plurality of computing devices withdistributed processing.

As used herein, the term “database” may refer to a single database orother structured data storage, or to a collection of two or moredifferent databases or structured data storage components. In theillustrative embodiment, the database 120 is part of the computingdevice 110. In some embodiments, the computing device 110 may access thedatabase 120 via a network such as network 150. The database 120 maystore data (e.g., input, output, intermediary data) that is necessary togenerate a universal continuous latent space representation. Forexample, the data may include genotypic data, such as single nucleotidepolymorphisms (SNPs), genetic markers, haplotype, sequence information,and/or phenotype data that are obtained from one or more servers 130,140.

The computing device 110 may further include a number of softwareapplications stored in a memory unit 118, which may be called a programmemory. The various software applications on the computing device 110may include specific programs, routines, or scripts for performingprocessing functions associated with the methods described herein.Additionally or alternatively, the various software applications on thecomputing device 110 may include general-purpose software applicationsfor data processing, database management, data analysis, networkcommunication, web server operation, or other functions described hereinor typically performed by a server. The various software applicationsmay be executed on the same computer processor or on different computerprocessors. Additionally, or alternatively, the software applicationsmay interact with various hardware modules that may be installed withinor connected to the computing device 110. Such modules may implementpart of or all of the various exemplary method functions discussedherein or other related embodiments.

Although only one computing device 110 is shown in FIG. 1, the server130, 140 is capable of communicating with multiple computing devicessimilar to the computing device 110. Although not shown in FIG. 1,similar to the computing device 110, the server 130, 140 also includes aprocessor (e.g., a microprocessor, a microcontroller), a memory, and aninput/output (I/O) controller (e.g., a network transceiver). The server130, 140 may be a single server or a plurality of servers withdistributed processing. The server 130, 140 may receive data from and/ortransmit data to the computing device 110.

The network 150 is any suitable type of computer network thatfunctionally couples at least one computing device 110 with the server130, 140. The network 150 may include a proprietary network, a securepublic internet, a virtual private network and/or one or more othertypes of networks, such as dedicated access lines, plain ordinarytelephone lines, satellite links, cellular data networks, orcombinations thereof. In embodiments where the network 150 comprises theInternet, data communications may take place over the network 150 via anInternet communication protocol.

Referring now to FIG. 2, a schematic diagram illustrating a use ofmarker information from multiple platforms to construct a universallatent representation of genotypes that are insensitive to an inputmarker platform is shown. As described further below, the universallatent representations may be used for various downstream analyses suchas marker imputation, haplotype imputation, genomic prediction, orpopulation genetic inference. To do so, various genotype/phenotypeapplications may involve using variational autoencoders (VAE). One suchexample is for universal translation between genotyping platforms. TheVAE are hybrids of deep neural networks and probabilistic graphicalmodels that enable construction of a compressed latent representationthat is independent of the underlying data generation (e.g., genotypingplatform) and serves as a basis of imputing characteristics of a desireddata set (e.g., multiple germplasm characterization). Because time spenton custom tailoring for machine-learning applications often produces anapplication of limited scope, the use of deep learning approachesreduces the labor and broaden the application of machine learning byautomating the construction of optimal feature spaces based on rawinputs, which were utilized to build a variety of VAEs described herein.

The core of VAE is rooted in Bayesian inference, which includes modelingof the underlying probability distribution of data, such that new datacan be sampled from that distribution, which is independent of thedataset that resulted in the probability distribution. VAEs have aproperty that separates them from standard autoencoders that is suitablefor generative modeling: the latent spaces that VAEs generate are, bynature of the framework, probability distributions, thereby allowingsimpler random sampling and interpolation for desirable end-uses. VAEsaccomplish this latent space representation by making its encoder notoutput an encoding vector of size n, rather, outputting two vectors ofsize n: a vector of means, μ, and another vector of standard deviations,σ. Some of the basic notions for VAE include for example:

-   -   X: data that needs to be modeled, for example, genotypic data        (such as SNPs, markers, haplotype, sequence information)    -   z: latent variable    -   P(X): probability distribution of the data, for example,        genotypic data    -   P(z): probability distribution of latent variable (e.g.,        genotypic associations from the underlying genotypic data)    -   P(X|z): distribution of generating data given latent variable,        e.g. prediction or imputation of the desired outcome based on        the latent variable.

VAE is based on the principle that if there exists a hidden variable z,which generates an observation or an outcome x, then one of theobjectives is to model the data, i.e., to find P(X). However, one canobserve x, but the characteristics of z need to be inferred. Thus,p(z|x) needs to be computed.

p(z|x)=p(x|z)p(z)/p(x)

However, computing p(x) is based on probability theory, in relation toz. This function can be expressed as follows:

p(x)=∫p(x|z)p(z)dz

While the p(x) function is an intractable distribution, variationalinference is used to optimize the joint distribution of x and z. Thefunction p(z|x) is approximated by another distribution q(z|x), which isdefined such that it is a tractable distribution. The parameters ofq(z|x) are defined such that they are highly similar to p(z|x) andtherefore, it can be used to perform approximate inference of theintractable distribution. KL divergence is a measure of differencebetween two probability distributions. Therefore, if the goal is tominimize the KL divergence between the two distributions, thisminimization function is expressed as:

min KL(q(z|x)∥p(z|x))

This expression is minimized by maximizing the following:

Eq(z|x)log p(x|z)−KL(q(z|x)∥p(z))

Reconstruction likelihood is represented by the first part, and thesecond term penalizes departure of probability mass in q from the priordistribution, p. q is used to infer hidden variables (latentrepresentation) and this is built into a neural network architecturewhere the encoder model learns the mapping relation from x to z and thedecoder model learns the mapping from z back to x. Therefore, the neuralnetwork for this function includes two terms1'one that penalizesreconstruction error or maximizes the reconstruction likelihood and theother that encourages the learned distribution q(z|x) to be highlysimilar to the true prior distribution p(z), which is assumed to followa unit Gaussian distribution, for each dimension j of the latent space.This is represented by:

${\mathcal{L}( {x,\hat{x}} )} + {\sum\limits_{j}{K{L( {{q_{j}( z \middle| x )}{}{p(z)}} )}}}$

It should be appreciated that the variational autoencoder is one ofseveral techniques that may be used for producing compressed latentrepresentations of raw samples, for example, genotypic association data.Like other autoencoders, the variational autoencoder places a reduceddimensionality bottleneck layer between an encoder and a decoder neuralnetwork. Optimizing the neural network weights relative to thereconstruction error then produces separation of the samples within thelatent space. However, unlike generative adversarial networks (GAN), theencoder neural network's outputs are parameterized univariate Gaussiandistributions with standard N(0,1) priors. Thus, unlike otherautoencoders, which tend to memorize inputs and place them inarbitrarily small locations within the latent space, the variationalautoencoder produces a smooth, continuous latent space in whichsemantically-similar samples tend to be geometrically close—e.g.,haplotypes that co-segregate to provide a certain phenotype.

For example, in the context of genomic characterization, a smoothspatial organization of the latent space captures varying levels ofancestral relationships that are present within a dataset. Genomicvariation within a population such as a plant breeding program may becharacterized by a variety of methods. For example, genotypes arecharacterized with a common platform that interrogates localizedvariants such as single nucleotide polymorphisms (SNPs) and/orinsertions/deletions (indels). Due to the ancestral recombination anddemographic history of the population, these variants tend toco-segregate within linked segments (haplotypes). Further, singlegenotypes may then be further characterized by the set of haplotypesthey contain. For example, as described further below, VAEs may be usedto compress the information contained within a given set of productionmarkers to a common, marker-invariant, latent space capable of capturingthese co-segregation patterns genome-wide.

In an embodiment that characterizes genotypic associations, certainfeatures of VAE may be divided into two sources: first, large linkedregions associated with recent family structure and second, highlylocalized statistical associations—linkage disequilibrium(LD)—associated with ancient ancestry. To do so, as illustrated in FIG.3, the deep neural networks, including a global encoder network, a localencoder network, and a local decoder network, are structured aroundthese features by training two stages.

First, a VAE may be trained with inputs from across a genome. The inputsmay include production markers. The outputs that determine thereconstruction error may also be taken from across the genome; they mayconstitute a different set from the input markers. The resultant latentspace from the global encoder geometrically is configured to approximaterecent kinship and longer-distance ancestral relationships among thegermplasm. For example, as illustrated in FIG. 3, a global encoder istrained to represent genetic marker co-segregation and pedigreerelationships based on a full set of input SNPs, and this is encodedwithin the global latent representation.

Second, local encoder and decoder neural networks may then be trainedfor each smaller subsection of the genome. The local encoder networkprovides a high resolution representation of the LD within a localgenomic region. One such input to a local encoder, for example, is asubset of the production SNPs localized to the encompass the region ofinterest (e.g. a chromosome or a particular QTL). Once the local encoderis trained, the local decoder network may be trained to imputehaplotypes within a defined genomic bin of that local region. The inputto the local decoder is the combination of latent outputs from the localencoder and the—now frozen—global encoder, as shown in FIG. 3. Thereconstruction objective for the local encoder/decoder combination, forexample, is a set of markers within a small contiguous region (e.g. 1centimorgan (cM) on a genetic map), which encourages the local latentrepresentation to capture the highly localized linkage disequilibrium(LD) that may have been overlooked by the global encoder. It should beappreciated that, in some embodiments, the contiguous region may bedefined in physical coordinates. Once constructed, the combination ofthe global latent space and the local latent space within a regionprovide a compressed representation of available information necessaryfor haplotype reconstruction and—by extension—any inference methodconditioned upon genotypic data.

It should be appreciated that in some embodiments, for example, as shownin FIGS. 4 and 5, the encoder inputs to the global encoder and the localencoder may include production markers from multiple or potentiallydisjoint platforms for imputing a unified set of markers onto separatepopulations of germplasm. As shown in FIG. 4, two populations may haveunknown historical populations and/or little or no shared markersbetween their legacy marker platforms. The imputation process describedin FIGS. 5A and 5B, which is conditioned on the latent representationsof the underlying marker information, produces a unified view of markersacross the legacy platforms in both populations. This unified marker setthen enables molecular breeding applications, haplotype frameworkgeneration, and/or diversity characterization that is independent of theoriginal genotyping platform.

The imputation process shown in FIGS. 5A and 5B is similar to onedescribed in FIG. 3. However, the imputation process of FIGS. 5A and 5Bis different in that combined latent representations may be produced byinputting combined production markers from two different groups orpopulations of germplasm, Group A and Group B. Although two groups areshown in FIGS. 5A and 5B, it should be appreciated that productionmarkers from more than two groups may be used as inputs to producecombined latent representations. Step 1 of FIG. 5A illustrates theconstruction of a global latent representation, which represents markerco-segregation and pedigree relationships independently of the group oforigin due to the need to reconstruct a common set of high-density SNPsbetween the group. Step 2 of FIG. 5A illustrates the training of localencoder networks that provide a latent representation of the local LDwithin each region, after accounting for the global relationships. Thecombined latent representations then allow for imputation of a unifiedset of production SNPs through local decoder networks, illustrated instep 3 of FIG. 5B.

Referring now to FIGS. 13-15, examples of input to the global and localencoders and output from the local decoder are shown. In theillustrative embodiment, the global encoder is trained with input thatis coded as homozygous, heterozygous, or missing for a particularallele. For example, as shown in FIG. 13, a numeric value(s) is assignedto each marker indicating whether the allele is homozygous,heterozygous, or missing. In the illustrative embodiment, there are Mnumber of markers over entire genome, and each marker is a choicebetween bases (adenine (A), guanine (G), cytosine (C), and thymine (T))or between insertions and deletions (I, D). Each marker has a choicebetween a first base and a second base at a specific allele. If anexample genotype (i.e., a sample) has the homozygous first base, thenthat marker is assigned a numeric value 1. If, however, the examplegenotype has the homozygous second base, then that marker is assigned anumeric value −1. It should be appreciated that, in the illustrativeembodiment, the markers are probabilistic calls rather than hard calls,as indicated by Marker M-1. For example, for Marker M-1, based ongenotypes of the sample's parents, it may predict that the sample islikely to have the homozygous first base A with a probability of 0.9 andthe homozygous second base C with a probability of 0.1. As such, in theillustrative embodiment, an example input for that marker is calculatedby (0.9×1)+(0.1×−1)=0.8.

In the illustrative embodiment, Channel 2 is also generated to indicatewhether the marker is homozygous (0), heterozygous (1), or missing (−1).However, it should be appreciated that, although two channels are shownin FIG. 13, only one channel may be used as input to one or moreencoders. It should also be appreciated that any number, value, or codemay be assigned in order to distinguish these features to generateformatted input to one or more encoders.

As shown in FIG. 14, the encoding of markers across the genome is thenused to train the global encoder to produce a representation of a latentdistribution. The global decoder then takes a sample from the latentdistribution as an input and reconstructs the original marker set (Mmarkers). For example, the value 0.99 in the first column of the ExampleGlobal Output indicates that there is a high probability of a presenceof a first allele (i.e., homozygous base C in this example as indicatedin FIG. 13) at the locus that corresponds to Marker 1. Whereas, thevalue −0.95 in the third to the last column of the Example Global Outputindicates that there is a high probability of second allele (i.e.,deletion of a base in this example as indicated in FIG. 13) at the locusthat corresponds to Marker M-2. The value −0.3 in the second columnindicates that it is uncertain probability of second allele (i.e.,homozygous base G in this example as indicated in FIG. 13) at the locusthat corresponds to Marker 2. It should be appreciated that, in theillustrative embodiment, the parameters of the global encoder are heldconstant during the training.

Subsequently, as shown in FIG. 15, the local encoder receives input froma subset of the M markers that are located within a contiguous genomicregion (i.e., Chromosome C in this example) and then produces a latentrepresentation encoding local information after accounting for theglobal latent representation. The local decoder receives the global andlocal latent representation samples as an input and provides areconstruction for markers within a given genomic window. To interpretthe output of the local decoder, a different threshold may be predefinedbased on a desired level of accuracy. It should be noted there is atrade-off between accuracy and missingness within the imputed value. Forexample, by increasing the level of accuracy, a certain marker may beset to missing due to insufficient confidence. For example, a predefinedthreshold may be set to 0.75. In other words, if the output value for amarker is greater or equal to 0.75, that marker is denoted to havesufficient confidence for an allele call of 1. If, however, the absolutevalue of the output is less than 0.75, then that marker does not havesufficient confidence for imputation and is set to be missing from thatspecific genomic region. As such, in the illustrative embodiment, theresulting output markers on Chromosome C is translated to “C G T G . . .T D A I.”

It should be noted that FIGS. 16, 17, and 20, which are describedfurther below, utilize the Example Input as input to the global and/orlocal encoder. Although only one local encoder and one local decoder isshown in FIG. 15, it should be appreciated that, in some embodiments, asystem may include multiple local encoders and corresponding localdecoders for different genomic region. Each local encoder and decoderare trained to produce and translate a latent representation within aspecified genomic region.

The global and local variational autoencoder framework describedprovides a general method for translation into a universal, platformindependent (e.g., marker-independent), latent space. The details of thenetwork structure and the training approach are readily adapted oradjusted to suit any particular application. For instance, convolutionalneural networks are used for encoders and/or decoders in order toenforce known spatial structure on hidden layer representations.Generally, optimal performance in testing datasets requires dataaugmentation, with the augmentation mechanism conditioned uponbiological mechanisms and the structure of the populations of interest.

Observed genotypes are supplemented with plausible in silico, predictivecrosses to expand the initial finite training set to an effectivelyinfinite training set capable of representing the full diversity ofpotential haplotype combinations. Input markers can also be maskedrandomly with missingness patterns observed in the initial dataset. Thebiological cross augmentation mechanism allows both encoder and decoderneural networks to extrapolate beyond the initial sequenced material toany likely combination of haplotypes, while the augmentation withmissing data ensures well-calibrated uncertainty measures within boththe latent space and the data reconstructions.

Referring now to FIG. 6, potential genomic prediction applications basedon the latent representations are shown. A unified set of legacy markerscan be imputed and then used directly for whole genome prediction basedon linear combinations of markers in a legacy comprehensive map.Alternatively, a decoder neural network may be trained to directlytranslate latent representation to phenotypes of interest. It should benoted that some examples of the potential genomic predictionapplications are further described in Examples 1-3 below.

FIG. 7 is an exemplary method of predicting coancestry betweengenotypes. Latent representations from two genotypes are given to aneural network, which then estimates the coancestry between them. Thetwo genotypes may originate from the same or different populations, andthe marker sets may or may not be disjoint. It should be noted thatimputing coancestry is further described in Example 6 below.

FIG. 8 illustrates that imputed information based on common latentrepresentations of the underlying marker information from multiplemarker platforms may be used in clustering, selection inferences,population genetics summaries such as F-statistics, and/or historicaldemographics.

Referring now to FIG. 9, an exemplary graph illustrating how theuniversal translation of the underlying disjoint marker information maylead to robust, genetically-meaningful representations. Graph A shows areduced-dimensionality visualization of the global latent space of twopopulations (i.e., Population 1 and Population 2) with disjoint markersets. Despite disjointed inputs, the latent representations of agermplasm originated from Population 2 genotyped on the Population 1marker platform leads to clustering with Population 1's genotypedversions of those inbred lines.

Referring now to Graphs B and C of FIG. 9, Euclidean distances of latentrepresentations (Graph B) and Pearson correlations of the latentrepresentations (Graph C) are shown. As shown in Graph B, the Euclideandistance of latent representations produced by a global encoder withdifferent marker platform inputs of the same breeding line is near zero,which is indicated as “Self” in Graph B. This indicates that thedifferent marker platform inputs of the same breeding line are close toone another. On the other hand, when different marker platform inputs ofdifferent breeding lines are used as inputs to the global encoder, theEuclidean distance is significantly greater than zero, which isindicated as “Non-Self” in Graph B.

Similarly, as shown in Graph C of FIG. 9, the Pearson correlation of thelatent representations produced by a global encoder with differentmarker platform inputs of the same breeding line is near one, which isindicated as “Self” in Graph C. On the other hand, when different markerplatform inputs of different breeding lines are used as inputs to theglobal encoder, the Pearson correlation is around zero, which isindicated as “Non-Self” in Graph C. In other words, for distinctgenotypes, these measures are significantly different. Graphs B and C ofFIG. 9 again illustrate that the encoder is robust to the markerplatforms and is relatively invariant to which marker platform is beingused as long as the markers are from the same breeding line.

FIG. 10 illustrates that latent representations may be used to predictcoancestry of individuals within and between various populations asshown in Graph A. Additionally, as shown in Graph B, the latentrepresentations may be also be used to predict whole-organismphenotypes, as shown here for YIELD within wheat.

FIG. 11. illustrates embodiments of how haplotype information, which canbe imputed based on the universal latent space, may be leveraged forpooling of statistical power in molecular function studies based onreplication at the level of the haplotype.

FIG. 12 is an example showing how leveraging of the haplotypeinformation through latent representations results in increasedstatistical power to detect accessible chromatin based on an ATAC-seqassay. Graph A illustrates the accuracy and power of thehaplotype-pooling approach. The location of detected ATAC-seq peaks iscompared to those from an independent assay of chromatin accessibility.Peaks detected with or without pooling are both highly enriched withinproximity to previously detected peaks relative to random expectation.However, haplotype pooling increases the number of detected peaks bymore than an order of magnitude without a substantial loss in accuracy.Graphs B and C illustrate examples of detected peaks using haplotypepooling. Grey lines correspond to tissue peaks that were only detectedusing haplotype pooling. Graph B illustrates the detection of peaks atalternative TSSs of a single gene, while Graph C illustrates thedetection of peaks at a known major QTL in maize that is 65 kb from thenearest protein-coding gene.

As used in this specification and the appended claims, terms in thesingular and the singular forms “a,” “an,” and “the,” for example,include plural referents unless the content clearly dictates otherwise.Thus, for example, reference to “plant,” “the plant,” or “a plant” alsoincludes a plurality of plants; also, depending on the context, use ofthe term “plant” can also include genetically similar or identicalprogeny of that plant; use of the term “a nucleic acid” optionallyincludes, as a practical matter, many copies of that nucleic acidmolecule; similarly, the term “probe” optionally (and typically)encompasses many similar or identical probe molecules.

As used herein, the terms “comprises,” “comprising,” “includes,”“including,” “has,” “having,” “contains”, “containing,” “characterizedby” or any other variation thereof, are intended to cover anon-exclusive inclusion, subject to any limitation explicitly indicated.For example, a composition, mixture, process, method, article, orapparatus that comprises a list of elements is not necessarily limitedto only those elements but may include other elements not expresslylisted or inherent to such composition, mixture, process, method,article, or apparatus.

As used herein, the term “haplotype” generally refers to the genotype ofany portion of the genome of an individual or the genotype of anyportion of the genomes of a group of individuals sharing essentially thesame genotype in that portion of their genomes.

As used herein, the term “encoder” generally refers to a network whichtakes in an input and generates a representation (the encoding) thatcontains information relevant for the next phase of the network toprocess it into a desired output format. Generally, the encoder istrained in parallel with the other parts of the network, optimized viaback-propagation, to produce representations that are specificallyuseful for the desired output. For example, a suitable encoder may use aconvolutional neural network (CNN) structure, and multi-dimensionalencodings or representations are produced. Autoencoders make the encodergenerate encodings or representations that are useful for reconstructingits own/prior input and, the entire network may be trained as a wholewith the goal of minimizing reconstruction loss.

As used herein, the term “global encoder” generally refers to a networkwhich takes in genome-wide genotypic or phenome-wide phenotypic data asinput and generates a representation (the encoding) that containsinformation relevant for the next phase of the network to process itinto a desired output format.

As used herein, the term “local encoder” generally refers to a networkwhich takes in a subset of the genome-wide genotypic or phenome-widephenotypic data used as input for the global encoder and generates arepresentation (the encoding) that contains information relevant for thenext phase of the network to process it into a desired output format.

As used herein, the term “decoder” generally refers to a network whichtakes in the output of the encoder and reconstructs a desired outputformat.

As used herein, the term “global decoder” generally refers to a networkwhich takes in the output of the global encoder and reconstructs adesired output format.

As used herein, the term “local decoder” generally refers to a networkwhich takes in the output of the global encoder and the output from oneor more local encoders and reconstructs a desired output format.

Embodiments of the disclosure presented herein provide methods andcompositions for using latent representations of data to impute orpredict information.

In one embodiment, the imputed or predicted genotypic or phenotypicinformation is used for genomic prediction, including, but not limitedto, whole genome prediction (WGP). Non-limiting examples include but arenot limited to those described in WO2016/069078 Improved MolecularBreeding Methods, published May 6, 2016; and WO2015/100236 ImprovedMolecular Breeding Methods, published Jul. 2, 2015, each of which isincorporated herein by reference in their entirety. For example, imputedgenotypic or predicted phenotypic information and optionally with abiological model such as a biological model that includes gene networks,biochemical pathways, physiological crop growth model (CGM) orcombinations thereof, may be used to predict phenotype or traitperformance for individuals under various types of environmentalconditions. Exemplary types of environmental conditions include but arenot limited to increased or decreased water supply in soil, temperature,plant density, and disease or pest stress conditions. One or moreindividuals having a desired predicted phenotype or trait performancemay be produced, grown or crossed with itself or another individual togenerate offspring with a desired predicted phenotype or traitperformance. Accordingly, in one embodiment, the methods are used toselect individuals for use in a breeding program. In another embodiment,one or more individuals having an undesired predicted phenotype or traitperformance may be culled from a breeding program.

In another embodiment, imputed molecular and whole plant information maybe used to predict phenotype or trait performance for individuals.

In one embodiment, a universal method of parametrically representinggenotypic or phenotypic association data from a training data setobtained from a population or a sample set to impute genotype and/orphenotype in a test data obtained from a test population or a testsample data is provided herein.

Any population of interest may be used with the methods and compositionsdescribed herein. While the methods disclosed herein are exemplified anddescribed primarily using plant populations, the methods are equallyapplicable to animal populations, for example, non-human animals, suchas domesticated livestock, laboratory animals, companion animals, etc.The animal may be a poultry species, a porcine species, a bovinespecies, an ovine species, an equine species, or a companion animal, andthe like. Accordingly, in some embodiments, the population is apopulation of plants or animals, for example, plant or animalpopulations for use in a breeding program. In some examples, the one ormore populations include plant populations of inbred plants, hybridplants, doubled haploid plants, including but not limited to F1 or F2doubled haploid plants, offspring or progeny thereof, including thosefrom in silico crosses, or any combination of one or more of theforegoing. Any monocot or dicot plant may used with the methods andcompositions provided herein, including but not limited to a soybean,maize, sorghum, cotton, canola, sunflower, rice, wheat, sugarcane,alfalfa tobacco, barley, cassava, peanuts, millet, oil palm, potatoes,rye, or sugar beet plant. In some embodiments, the genotypic data and/orphenotypic data is obtained from a population of soybean, maize,sorghum, cotton, canola, sunflower, rice, wheat, sugarcane, alfalfatobacco, barley, cassava, peanuts, millet, oil palm, potatoes, rye, orsugar beet plants.

In some examples, the genotype of interest is associated with adesirable trait of interest and/or the absence of undesirable trait ofinterest.

Plant or animal populations or one or more members thereof that areimputed or predicted to have a desired genotype of interest or phenotypeof interest may be selected for use in a breeding program. For example,the population or one or more members may be used in recurrentselection, bulk selection, mass selection, backcrossing, pedigreebreeding, open pollination breeding, and/or genetic marker enhancedselection. In some instances, a plant having the imputed or predicteddesirable genotype of interest or phenotype of interest may be crossedwith another plant or back-crossed so that the imputed or predicteddesirable genotype may be introgressed into the plant by sexualoutcrossing or other conventional breeding methods.

In some examples, plant having the imputed or predicted desirablegenotype of interest or phenotype of interest may be used in crosseswith another plant from the same or different population to generate apopulation of progeny. The plants may be selected and crossed accordingto any breeding protocol relevant to the particular breeding program.

In other examples, plant having the imputed or predicted undesirablegenotype of interest or phenotype of interest may be counter-selectedand removed from a breeding program.

In some aspects, the method includes generating a universal continuousglobal latent space representation by encoding discrete or continuousvariables derived from a genome-wide genotypic or phenome-widephenotypic association training data into latent vectors through amachine learning-based global variational autoencoder framework. In someaspects, the global latent space is independent of the underlyinggenotypic or phenotypic association. In some aspects, the methodincludes generating a local latent representation by encoding a subsetof the discrete or continuous variables derived from the genotypic orphenotypic association training data set into latent vectors through amachine learning-based local variational autoencoder framework, wherethe local latent space is generated with inputs from the localvariational autoencoder and the global variational autoencoder. In someaspects, the method includes decoding the global latent representationand the local latent representation by a local decoder, thereby imputingor predicting the genotype or phenotype of the test data by thecombination of the decoded global latent representation and the locallatent representation.

In some aspects, the genotypic association data includes a collection ofgenotypic markers or single nucleotide polymorphisms (SNPs) from aplurality of a genetically divergent population. The subset of thediscrete variables may be a plurality of single nucleotide polymorphisms(SNPs) localized to a segment of the chromosome. In some aspects, thevariational autoencoder is based on a neural network algorithm. In someaspects, the phenotype that is imputed or predicted in the test data ortest sample is predicted yield gain. In some aspects, the imputed orpredicted phenotype in the test data or test sample is root lodging,stalk lodging, brittle snap, ear height, grain moisture, plant height,disease resistance, drought tolerance, or a combination thereof. In someaspects, the imputed or predicted genotype that is in the test data ortest sample is a plurality of haplotypes. In some aspects, the localdecoder imputes local high-density (HD) SNPs.

In some aspects, the genotypic association data is obtained frompopulations of plants derived from two or more breeding programs, wherethe breeding programs do not comprise an identical set of markers orsingle nucleotide polymorphisms (SNPs) corresponding to the genotypicassociation data. In some aspects, the local decoder imputes localhigh-density (HD) SNPs of one population based on the decoding ofgenotypic association data of another population. In some aspects, thelocal decoder imputes haplotypes for one population based on thedecoding of genotypic association data of another population. In someaspects, the local decoder imputes or predicts a molecular phenotypeincluding but not limited to gene expression, chromatin accessibility,DNA methylation, histone modifications, recombination hotspots, genomiclanding locations for transgenes, transcription factor binding status,or a combination thereof. Gene expression may include a change in theactivity or level of expression of transcripts, genes, or othertranscribed nucleotide sequences including those global (genome-wide) orlocal or a subset thereof, a population (subset) of genes, or a gene ofinterest. In some aspects, the local decoder imputes or predictspopulation coancestry for one or more of the test populations.

Also provided herein in an embodiment is a universal method ofparametrically representing genotypic or phenotypic association datafrom a training data set obtained from a population or a sample set toinfer a characteristic of interest, e.g. a desirable characteristic, intest data obtained from a test population or a test sample data. In someaspects, the method includes generating a universal continuous globallatent space representation by encoding discrete or continuous variablesderived from a genome-wide genotypic association or phenome-widephenotypic training data into latent vectors through a machinelearning-based global variational autoencoder framework, where theglobal latent space is independent of the underlying genotypic orphenotypic association. In some aspects, the method includes decodingthe global latent representation by a global decoder, thereby inferringthe characteristic of interest, e.g. a desirable characteristic, of thetest data by the decoded global latent representation.

In some aspects, the characteristic of interest, e.g. a desirablecharacteristic, is without limitation coancestry determination of two ormore populations of plants or predicting yield gain or an agronomicphenotype of interest. In some aspects, the variational autoencoder isbased on a neural network algorithm.

Also provided herein is a universal method of developing universalrepresentation of genotypic or phenotypic data that includes receivingby a first neural network one or more training genotypic or phenotypicdata, where the first neural network includes a global variationalautoencoder. In some aspects, the method includes encoding by the globalencoder, the information from one or more training genotypic orphenotypic data into latent vectors through a machine-learning basedneural network training framework. In some aspects, the method includesproviding the encoded latent vectors (generated from other genotypic orphenotypic data) to a second machine-learning based neural network,where the second neural network includes a decoder. In some aspects, themethod includes training the decoder to learn a prediction or imputationof a genotype or phenotype of interest based on an objective functionfor the encoded latent vectors. In some aspects, the method includesdecoding by the decoder the encoded latent vector for the objectivefunction. In some aspects, the method includes providing an output forthe objective function of the decoded latent vector.

Also provided herein is a method of selecting an attribute of interestbased on genotypic or phenotypic data. In some aspects, the methodincludes receiving by a first neural network one or more training globalgenotypic or phenotypic data, where the first neural network includes aglobal variational autoencoder. In some aspects, the method includesencoding by the global variational autoencoder, genotypic or phenotypicinformation from one or more training genotypic or phenotypic data intolatent vectors. In some aspects, the method includes training the globalvariational autoencoder using the latent vectors to learn underlyinggenotypic or phenotypic correlations and/or relatedness. In someaspects, the method includes receiving by a second neural network one ormore training local genotypic or phenotypic data, where the localgenotypic or phenotypic data is directed to a subset of global genotypicor phenotypic data that corresponds to a certain attribute of interest,where the second neural network includes a local variationalautoencoder. In some aspects, the method includes encoding by the localvariational autoencoder, the genotypic or phenotypic information fromthe one or more training local genotypic or phenotypic data into latentvectors. In some aspects, the method includes training the localvariational autoencoder using the latent vectors to learn underlyinggenotypic or phenotypic correlations and/or relatedness for theattribute of interest. In some aspects, the method includes providingthe encoded latent vectors from the global variational autoencoder and/local encoder to a third neural network, where the third neural networkincludes a decoder. In some aspects, the method includes training thedecoder to predict the attribute of interest for the encoded latentvectors from the global variational autoencoder and/ the localvariational autoencoder using a pre-specified or learned objectivefunction. In some aspects, the method includes decoding by the decoder,the encoded latent vectors for the objective function. In some aspects,the method includes providing an output for the objective function ofthe decoded latent vector.

The decoder may include one or more decoders. In some aspects, thedecoder is a local decoder. In some aspects, the decoder is a globaldecoder and decodes the encoded latent vectors from the global encoder.In some aspects, the global training genotypic data includes markersacross the genome. In some aspects, the local genotypic data is from aspecific chromosomal genomic region of interest or allele. In someaspects, the method includes training the global encoder and decodersimultaneously.

In some aspects, the local attribute may include without limitationSNPs, alleles, markers, QTLs, gene expression, phenotypic variation,metabolite level, or combinations thereof. In some aspects, the encodermay be an autoencoder. In some aspects, the autoencoder is a variationalautoencoder.

In some aspects, the training genotypic data includes without limitationSNPs or indels sequence information. In some aspects, the traininggenotypic or phenotypic data includes sequence information from insilico crosses. In some aspects, the encoder weights are updatedrelative to a reconstruction error so that the training genotypic orphenotypic data information is separated within the latent space. Insome aspects, the decoder is trained on existing genotypic or phenotypicdata.

Also provided herein is a computer system for generating genotypic orphenotypic data determinations. In one embodiment, the system includes afirst neutral network that includes a variational autoencoder configuredto encode genotypic or phenotypic information from one or more traininggenotypic or phenotypic data into universal latent vectors, where theencoder has been trained to represent genotypic or phenotypicassociations through a machine-learning based neural network frameworkand a second neural network includes decoder configured to decode theencoded latent vectors and generate an output for an objective function.

In an embodiment, a computer system, includes one or more computerprograms or other software elements or special programmableinstructions, or computer-implemented logic that is configured toparametrize genotypic data, phenotypic data, association data or acombination thereof into latent space as described herein. In anembodiment, the computer system is connected, via a network, to one ormore data resources.

EXAMPLES

The present invention is illustrated by the following examples. Theforegoing and following description of the present invention and thevarious examples are not intended to be limiting of the invention butrather are illustrative thereof. Hence, it will be understood that theinvention is not limited to the specific details of these examples.

Example 1 Marker Imputation across Disparate Germplasm and MarkerPlatforms

The maize germplasm collections that originated from distinct closedbreeding programs were used for this analysis. These distinct germplasmpopulations were originally genotyped on disparate marker platforms witha small minority (about 2%) of markers in common between them. Wholegenome sequencing and exome capture sequencing efforts provided highdensity single nucleotide polymorphism (SNP) markers for a smallersubset (˜1200 breeding program A, ˜2500 breeding program B) of theavailable inbred lines, and these were mapped to a maize referencegenome. A subset of approximately 350,000 high density markers wereidentified to be in common between the two high density marker sets, andthese were selected to provide a measure of reconstruction error thatwould span both legacy sets of germplasm. Approximately 7,000 SNPs werealso identified in the high density data that were used as productionmarkers in one or the other breeding programs. These markers wereselected to augment the production marker input and output duringtraining of an autoencoder neural network. A subset of markers was setaside to serve as a basis for scoring the accuracy of cross-breedingprogram imputation when markers are completely disjoint during training.

As discussed above, the autoencoder neural network may be trained totranslate production markers from different populations of germplasminto a universal, platform independent (e.g., marker-independent),latent space. To do so, the training process involves three steps, asdescribed above with regard to FIGS. 5A and 5B. Steps 1 and 2 establisha common latent space between the two sets of germplasm at the globaland local scales, while Step 3 provides a decoder to translate from thecommon latent space to the union of legacy production markers. In thisexample, in order to augment the training set beyond the 3700 inbredlines available with high density data, synthetic F1 doubled haploidswere simulated based on in silico crosses between pre-specified pairs ofinbred lines from the high-density genotyped training set.

In step 1, the global encoder was trained with an input that includesthe union of legacy breeding program markers. In the illustrativeembodiment, markers are coded as homozygous for allele A, homozygous forallele B, or missing. Marker invariant latent representation wasenhanced through a randomized input scheme. For each input within eachminibatch, the set of production markers was randomly chosen to be thosefrom breeding program A, those from breeding program B, or those fromthe union based on production marker augmentation from the high densitySNPs. The dimension of the global latent space was set to 32, so that 32real numbers were sampled based on the global encoder output and sent tothe global decoder. The global decoder then translated the latent inputinto a reconstruction of the subset of high density SNPs (10,000) chosenfor global training, and the loss was calculated based on thereconstruction error and the KL-divergence between the latentrepresentation and the prior of univariate Gaussians.

In Step 2, local encoders and high-density local decoders were trainedwithin 10 cM bins across the breeding program A maize genetic map. Theinput to local encoders was restricted to the union of both the breedingprograms A and B production SNPs within the chromosome containing the 10cM bin of interest. Randomization of the input SNP set proceeded asdescribed in Step 1. The size of each local latent space was set to 16,with the Gaussian parameterization otherwise identical to that of theglobal encoder. Each local decoder received as input the sampled latentoutput of the local encoder, along with the sampled latent output of theglobal encoder. In this example, the global encoder weights were notupdated during the local training process. In Step 3, the local decodertranslated the combined global and local latent representations into areconstruction of the full set of high density SNPs located within each10 cM region of interest. The reconstruction error combined with theKL-divergence from the local latent Gaussian priors were used tocalculate the loss.

The weights of the global and all local encoders were frozen, and newlocal production marker decoders were trained for each 10 cM bin, withthe input of each local production marker decoder corresponding to thoseof the high-density marker decoder described in Step 2. The loss forthis step was only dependent on the reconstruction error of the combinedset of production markers, and loss was only accumulated for productionmarkers that were non-missing for a given inbred line. The randomizationscheme for the input markers followed that described in Steps 1 and 2.

Following training, imputation accuracy and characterization of thelatent space was assessed on a pre-specified held-out, randomly-selectedtesting set spanning the legacy organizations. Euclidean distances oflatent vectors for the same inbred line encoded by the disjoint markersets of the legacy organizations were clustered near zero, whiledistances for non-identical lines formed a Gaussian distribution with amode around 8. Pearson correlations for the latent vectors of the sameinbred line with disjoint marker sets clustered near 1, compared to adistribution around 0 for non-identical lines. Testing accuracy ofimputed high density SNPs ranged from 97.4% across 100% of high-densitySNPs when no confidence cutoff was imposed to 99.1% accuracy across93.3% of SNPs when a moderate threshold of 0.9 was used to 99.7%accuracy across 86.1% of SNPs when a high threshold of 0.99 was used.

Imputation accuracy for production SNPs varied with the breedingprograms and the disjointedness of the training regime for theassociated markers. Across all testing germplasm and markers—at thechosen moderate threshold of 0.9, imputation accuracy was 99.2% andcovered 91.5% of the union of breeding program B and breeding program Aproduction markers. Within breeding program A, testing accuracy forbreeding program B production markers that were augmented duringtraining was 98.5%, with 88.1% imputed. The breeding program A testingaccuracy for breeding program B markers that were left completelydisjoint during training was 96.6%, with 85.4% of these disjoint markersimputed. For breeding program b, testing accuracy for breeding program Aproduction markers augmented during training was 99.3%, with 93% ofmarkers imputed. For breeding program B non-augmented markers, theaccuracy was 97.5% with 90% of markers imputed.

Thus, this example demonstrates that by employing machine-learning basedvariational autoencoder framework for global and local encoding followedby decoding successfully imputes marker data across disparate breedingprograms that do not necessarily share substantially the same genotypicassociation data set (e.g., marker or sequence information). Thisexample also demonstrates that such imputation efficiency can acceleratebreeding including for example selection of breeding pairs, predictinghybrid performance such as yield, lodging and other desirablecharacteristics.

Example 2 Haplotype Imputation from Latent Space

Haplotypes—generally referred to herein as linked sets of co-segregatingmarkers in a population—provide a useful means for visualizing geneticvariation and imputing functional information to regions of identicalsequence across a given population. Using the 350,000 high-densitymarkers in common between breeding program B and breeding program Agermplasm—as described in Example 1—a common haplotype framework wasestablished between the breeding program datasets by assigning groups ofnear identical sequence within each specified region to commonhaplotypes. Such regions have been defined on both genetic (e.g. 1 cM)and physical (e.g. 1 Mb) maps, including haplotypes at the individualgene level. At the 1 cM genetic scale, regions with high density SNPidentity of at least 97% were considered to have common haplotypes.However, generalization of the haplotype framework to inbred lineswithout high density markers required the use of the genotypicinformation captured within the global and local latent representations.

Following the training of the cross-breeding programs global and localencoders described in Example 1, local haplotype decoders were trainedwithin each haplotype bin. As input, each haplotype decoder received theglobal latent representation and the local latent representation for theregion containing the haplotype bin. The output layer of each decoderwas set to the same size as the total number of haplotypes in the bin,and the output activation function was specified such that the sum ofall scores for all haplotypes in a region would sum to 1. That is, thescore for any haplotype could be interpreted as a probability. Trainingproceeded using the same input randomization and in silico crossingscheme described in Example 1. The definitions of training and testingsets were also maintained from the training of global and local encodersin Example 1.

For example, FIG. 16 illustrates example input and output for ahaplotype decoder. Once the global and local encoders are trained asdescribed in Example 1, their parameters are held constant. The localdecoder is then trained to predict the probability of each haplotypewithin a genomic bin that is a subset of the local encoder's range(i.e., Chromosome C in this example). Each column of the local decoderoutput is associated with a particular haplotype, and a value in eachcolumn indicates a probability that the respective haplotype is presentwithin the specified bin on Chromosome C. For example, 0.99 in a thirdcolumn indicates that the probability that the bin from 1-2 cM onChromosome C has Haplotype 3 is 0.99.

Characterization of haplotype imputation accuracy was performed for bothbreeding program A and breeding program B testing germplasm followingthe completion of decoder training. At the chosen haplotype callingthreshold of 0.9, 77.3% of all haplotype bins within breeding program Acould be imputed with 96% accuracy, while 86.9% of breeding program Bhaplotypes were able to be called with 98.3% accuracy. For both breedingprograms (A and B), a particular breeding line, which had haplotypeswell represented within the training data, performed much higher thanaverage both in terms of total imputation frequency and accuracy. Lossof accuracy was primarily due to older inbreds, inbreds from differentsources outside the breeding programs, and inbreds with a low number ofmarkers.

Thus, this Example demonstrates that haplotypes for a test breedingpopulation can be imputed based on latent representations of theunderlying genotypic data (e.g., high-density markers) through globalencoding, local encoding and decoding using variational autoencodingframework.

Example 3 Imputation of Haplotypes in Multiple Crops

Haplotype frameworks were initiated with breeding program A germplasmfor crops outside of corn, including the monocot grass rice and thedicot legume soybean. Haplotype sets were constructed using methodsdescribed in Example 2, following whole genome sequencing andcharacterization of high-density SNP variation within representativelines originating from the breeding programs of each crop. Afterconstruction of the haplotype frameworks, imputation of the haplotypeswas initiated on non-sequenced members of each species using theinference from global and local latent spaces.

Approximately 700 production markers within rice and 2000 productionmarkers within soy were collected to serve as inputs for all global andlocal encoders. Prior to training, test sets were defined such that theywould only be used for characterization of imputation accuracy. Sets ofplausible crosses between breeding lines were also collected to allowfor data augmentation during training with in silico crosses betweenobserved lines.

The global encoders were first trained with variational autoencodingobjectives, using the same production markers for both the input to theglobal encoders and the output from the global decoders. The globaldecoders received sampled latent vectors from the global encoders duringtraining. The dimensionality of the global latent space was set to 32for each species, and the objective function for the global autoencoderframework included reconstruction error terms for the production markersand unit-Gaussian KL-divergence penalties for the latent space. Bothobserved and in silico crosses were sampled during training, in additionto random dropout of markers to simulate a wide sample of missingnessscenarios.

Following the completion of global encoder and decoder training,training of the local encoders and local haplotype decoders wasinitiated. Local encoders and decoders were trained simultaneously, witheach local encoder spanning a subsection of a single chromosome and eachlocal decoder spanning a single haplotype bin within the physical spanof the given local encoder. Sampling of in silico crosses and randomdropout of markers proceeded as in the training of the global encoder.The input to each local encoder consisted of the production markers fromonly its assigned chromosome, while the input to the local decoderincluded a sampled global latent vector from the global encoder and asampled local latent vector from the local encoder. As mentioned inExamples 1 and 2, the weights of the global encoder were not updatedduring training of the local encoders and decoders. The output of eachlocal decoder was set to the size of the number of haplotypes within thegiven bin, with the sum of all haplotype scores for an example summingto 1, as in Example 2. The objective function for the local encoders anddecoders consisted of the reconstruction error for the imputedhaplotypes and the KL-divergence between the unit Gaussian priors andthe distribution of the local latent space.

Following the completion of all global and local neural networks,haplotype imputation accuracy was assessed on the testing sets of eachcrop species. Within rice, a moderate threshold of 0.75 permittedhaplotype imputation over an average of 81% of each genome with anaccuracy of 97.5% using its ˜700 markers. The same threshold in soy with2000 markers led to a testing accuracy of 96.8% over an average of 79.8%of the genome.

Thus, this Example demonstrates that the imputation framework developedfor corn is also effective for other crops such as rice and a dicot soy.The accuracy of the haplotype imputation for rice and soy weresignificantly high as demonstrated above.

Example 4 Imputing Molecular Phenotypes

Many molecular features of interest—such as gene expression, chromatinaccessibility, DNA methylation, histone modifications, and transcriptionfactor binding status, hereafter referred to collectively as molecularphenotypes in this Example, are locally, or cis, regulated by short DNAsequences. Therefore, observed molecular phenotypes corresponding to agiven haplotype within a specified stage and/or tissue may be inferredto exist within other samples from the population containing the samehaplotype. Moreover, different tissues and stages have varying degreesof similarity at the molecular level, allowing some sharing ofinformation at the levels of both haplotype and tissue. Within breedingprogram A, the latent space transformations and the haplotype frameworkwere combined to optimally impute chromatin accessibility to thehaplotype level in corn.

An assay for transposase-accessible chromatin was run using sequencing(ATAC-seq) on 11 tissues in 11 diverse inbred corn lines, with 2 of theinbred lines having data on every tissues. Although the inbred lineswere chosen to represent the diversity of breeding program A maizegermplasm, there were many locations of haplotype sharing betweenindividual lines. Moreover, one line did not have high-density markeravailable and instead had its haplotypes imputed using the methodsdescribed in Examples 1 and 2. The sampled tissues included both rootand shoot derived organs at stages ranging from early seedling (V1) topost-flowering (R1).

Following alignment of read data and calling of read depth peaks withinindividual samples, a variational autoencoder framework was trained inorder to form a latent representation of peak sharing among haplotypesand tissues. One percent of the genome, as partitioned in physical spaceamong the maize reference genome, was chosen to serve as the trainingset for the latent space. The encoder received the peak signal for agiven region in all tissue replicates of all samples except for a queryhaplotype i in a query tissue j. All sample replicates from tissue jwith the haplotype i at the given region of the genome were set tomissing. The encoder transformed the peak signal inputs into areal-valued latent vector, as in Examples 1-3, which represented theco-occurrence of peaks among haplotypes and tissues. A sampled latentrepresentation was then passed to the decoder, which then transformedthe latent representation to a reconstruction of peak signals in allhaplotypes and tissues. Optimization of the objective function thenminimized the reconstruction error, with regularization based on theKL-divergence of the latent space distributions and unit Gaussians.

Example inputs and outputs for training an encoder for predictingmolecular phenotypes are shown in FIG. 18. To do so, the haplotype foreach inbred line within a genomic region is identified, and thisinformation is combined with the known tissue type of each individualsample. For each sample, Channel 1 indicates a value obtained from−log(p) peak signal of an individual sample run with a peak-callingalgorithm, and Channel 2 indicates whether a peak is designated asmissing. For the purpose of training the neural network, one or moresignals in a tissue and haplotype of interest is set to be missing.Specifically, in the illustrative embodiments, peaks for Haplotype 3 inleaf (i.e., Samples 1 and 3) are set to be missing, as indicated byvalue 0 in Channel 1 and value 1 in Channel 2. Subsequently,measurements of individual sample peak intensities are passed to anencoder with the missing peaks of Samples 1 and 3. The decoder issimultaneously trained to reconstruct the full set of signals. Theoutput data may be used for further training.

Additionally, example inputs and outputs for training a transformer forpredicting molecular phenotypes are shown in FIG. 19. The parameters forthe encoder are held constant, while the transformer is trained topredict the prior probability of true signal within a given haplotypeand tissue combination, which is set to be missing within the input(i.e. Samples 1 and 3). In other words, even though signals of Haplotype3 in leaf (i.e., Samples 1 and 3) were set to be missing, a priorprobability of Haplotype 3 having true signal in leaf in genomic regionis 0.9. This prior information can then be combined with the data fromthe missing input via a likelihood function in order to quantify thefull evidence of true signal within the genomic region.

After fitting of the latent space, training of a transformer networkbegan within the context of a probabilistic model of ATAC-seq signal.The transformer network received the latent representation as input andtransformed it into the prior probability of a signal in a tissue andhaplotype of interest. The input to the encoder remained the signals forall haplotypes and tissues except that of interest, allowing the priorprobabilistic model to be informed by only information outside of thedesired inference space. This prior model was then incorporated into amixture model of two distributions, one denoting values emanating fromtrue underlying chromatin accessibility signal and one denoting valuesfrom regions with zero true signal. Both were parameterized by gammadistributions, with terms for the power of specific replicates and—inthe case of the true signal distribution—a term for the strength of thetrue signal. Inference was conducted using a Bayes factor that comparedthe marginal likelihoods of the observed signal strengths under the truesignal and no signal distributions, with integration occurring over thetrue signal distribution. These Bayes factors factored in the priorprobabilities for each distribution, thereby allowing haplotypes andtissues to share information.

The resulting model was evaluated using a combination of simulation andassessment of real data. Under simulation, with an empirically derivedratio of true versus no-signal regions and reasonable levels of samplenoise, all true no-signal regions were found to have Bayes factors lessthan or equal to 1. Sensitivity was also reasonably high, with an areaunder the precision-recall curve greater than 0.8 for all tissues.Estimates of individual replicate statistical power and the covarianceof signals among tissues were highly positively correlated with the truevalues. When applied to real data, approximately 5 million additionalbases of peaks were able to be identified in the haplotypescorresponding to maize reference genome, beyond the peaks that could beidentified without application of the haplotype framework. Sixty percentof this peak space was within 100 base pairs of a previously identifiedaccessible region from a completely independent assay using micrococcalnuclease (MNase) sensitivity, which was 600% higher than the expectationunder a random distribution relative to previously identified peaks.

This Example demonstrates that by employing the variationalautoencoder-based training models, chromatin accessibility (a molecularphenotype) was predicted with greater accuracy than other methods.

Example 5 Predicting Agronomic Phenotypes

The latent representation of the genetic space also permits inference ofgenetic contributions to agronomic phenotypes, thereby enabling unifiedgenomic prediction of crops even without shared marker sets. Onephenotype of interest is the brittlesnap stalk lodging score provided bythe screening of corn hybrids with a wind machine. Training and testingsets on measured brittlesnap scores were obtained, with the testing setstratified to contain hybrids such that 0, 1, or both of the parentswere present within at least 1 training hybrid.

For example, as illustrated in FIG. 20, the global and local encoderswere trained as outlined in Examples 1 and 2, and a decoder was trainedto receive the global and local encoder representations of a givenhybrid's parents as input. It should be appreciated that, in theillustrative embodiment, each local encoder is associated with eachphenotype. Although only one phenotype decoder is shown in FIG. 20, itshould be appreciated that there are different phenotype decoders foreach phenotype. The decoder's output (2.4±0.1) consisted of a continuousprediction of the brittlesnap score. It should be appreciated that theweights of the global and local encoders were fixed during training,while those of the decoder were updated in order to minimize theprediction error for the phenotypic scores.

Following completion of training, testing accuracy was evaluated on theheld-out hybrids. Accuracy was measured via the Pearson's correlationcoefficient between the predicted and observed brittlesnap score. Theaccuracy for hybrids with 1 inbred completely absent from the trainingset was 0.625, while that for hybrids with both inbred parents somewherein the training set—but not including the testing combination—was 0.737.These values were highly correlative of the phenotype. This Exampledemonstrates that a commercially relevant agronomic characteristic waspredicted based upon the variational autoencoder framework describedherein.

Example 6 Population Coancestry from Latent Space

The coancestry between any two samples is a fundamental metric forperforming quantitative genetics analyses. Because the latent spacetransformation of the genetic space allows for a marker-invariant (ormarker independent) representation of the underlying genetics, it canalso be used for the calculation of population-genetics features such asthe coancestry between samples, as shown in FIG. 7.

Following the training of the global encoder, a decoder was trained tocalculate the coancestry between any two inbred lines in corn given theglobal latent representation of each line. Training proceeded with acombination of observed genotypes and in silico crosses between them, asoutlined in Examples 1-3. All observed genotypes used during trainingwere the same as the genotypes used for the training of the globalencoder, with a separate test set held out for final assessment ofaccuracy. Random dropout of input markers to the global encoders wasalso performed, as outlined in Examples 1-3. The weights of the globalencoder were not updated during training. The objective function was setto minimize the error between the predicted coancestry and the observedcoancestry, as calculated by the fraction of haplotype bins between anytwo lines that were identical in state. Finally, sampling of thetraining pairs was stratified according to true coancestry, such thatpairs with coancestry within bins of 0-0.1, 0.1-0.2, 0.2-0.3, 0.3-0.5,and 0.5-1 were sampled at even rates. This stratification scheme wasmotivated by the predominance of pairs with near-zero coancestry, whichled to higher variance of high coancestry predictions in the absence ofstratification.

For example, training of a coancestry decoder from a latent space isdescribed in FIG. 17. The coancestry decoder receives inputs from theglobal encodings of two different genotypes (i.e., inbred lines 1 and2). It outputs an estimate of the coancestry between the genotypes aswell as an estimate of the uncertainty in that prediction. In thisexample, the predicted value of the coancestries between inbred lines 1and 2 was 0.75±0.03.

Following training, accuracy of the coancestry calculation was assessedon a random set of 3200 pairs if inbred lines within the testing set.The overall Pearson's correlation between the predicted and truecoancestries was 0.964, with the mode of the predicted values followingthe diagonal and indicating good calibration of the predictedcoancestries. Thus, this Example demonstrates that variationalautoencoder framework can be used to determine ancestry relationships oftwo or more individual lines based on the latent representations ofthose lines. This latent representation can be marker-invariant ormarker-independent, providing a powerful way to examine ancestryrelationships without the need to do extensive marker analysis using thesame marker set.

1-29. (canceled)
 30. A universal method of parametrically representinggenotypic or phenotypic association data from a training data setobtained from a plant population or a plant sample set to impute orpredict a genotype and/or a phenotype in a non-training data obtainedfrom a non-training plant population or a non-training plant sampledata, the method comprising: generating a universal continuous globallatent space representation by encoding discrete or continuous variablesderived from a genotypic or phenotypic association training data intolatent vectors through a machine learning-based global autoencoderframework, wherein the global latent space is independent of theunderlying genotypic or phenotypic association; generating a locallatent representation by encoding a subset of the discrete or continuousvariables derived from the genotypic or phenotypic association trainingdata set into latent vectors through a machine learning-based localautoencoder framework, wherein the local latent space is generated withinputs from the local autoencoder and the global autoencoder; decodingthe global latent representation and the local latent representation bya local decoder, thereby imputing or predicting the genotype orphenotype of the non-training data by the combination of the decodedglobal latent representation and the local latent representation; andselecting one or more plant populations or members thereof based on theimputed or predicted genotype or phenotype.
 31. The method of claim 30,wherein the genotypic association training data is genome-wide genotypicassociation training data.
 32. The method of claim 30, wherein thephenotypic association training data is phenome-wide phenotypicassociation training data.
 33. The method of claim 30, wherein thegenotypic association data comprises a collection of genotypic markersor single nucleotide polymorphisms (SNPs) from a plurality ofgenetically divergent populations.
 34. The method of claim 30, whereinthe subset of the discrete variables is a plurality of single nucleotidepolymorphisms (SNPs) localized to a segment of the chromosome.
 35. Themethod of claim 30, wherein the genotypic association data is obtainedfrom populations of plants derived from two or more breeding programs,wherein the breeding programs do not comprise an identical set ofmarkers or single nucleotide polymorphisms (SNPs) corresponding to thegenotypic association data.
 36. The method of claim 30, wherein themachine learning-based global autoencoder framework or machinelearning-based local autoencoder framework or a combination thereof is avariational autoencoder.
 37. The method of claim 36, wherein thevariational autoencoder is based on a neural network algorithm.
 38. Themethod of claim 30, wherein the machine learning-based globalautoencoder framework, machine learning-based local autoencoderframework, or a combination thereof is a generative adversarial network.39. The method of claim 30, the method comprising: training the decoderto learn a prediction or imputation of a genotype or phenotype ofinterest based on an objective function for the encoded latent vectors.40. The method of claim 39, the method further comprising: decoding bythe decoder the encoded latent vector for the objective function. 41.The method of claim 40, the method comprising: providing an output forthe objective function of the decoded latent vector.
 42. The method ofclaim 30, the method comprising crossing with another population ormember the one or more selected populations or members thereof imputedor predicted to comprise a genotype associated with a desirable trait ofinterest.
 43. The method of claim 30, the method comprisingcounter-selecting from a breeding program the one or more selectedpopulations or members thereof imputed or predicted to comprise agenotype associated with an undesirable trait of interest.
 44. Themethod of claim 30, the method comprising crossing with anotherpopulation or member the one or more selected populations or membersthereof imputed or predicted to comprise a phenotype associated with adesirable trait of interest.
 45. The method of claim 30, the methodcomprising counter-selecting from a breeding program the one or moreselected populations or members thereof imputed or predicted to comprisea phenotype associated with an undesirable trait of interest.
 46. Themethod of claim 30, wherein the plant is a soybean, maize, sorghum,cotton, canola, sunflower, rice, wheat, sugarcane, alfalfa tobacco,barley, cassava, peanuts, millet, oil palm, potatoes, rye, or sugar beetplant.
 47. The method of claim 30, wherein the imputed or predictedphenotype is yield gain, root lodging, stalk lodging, brittle snap, earheight, grain moisture, plant height, disease resistance, droughttolerance, or a combination thereof.
 48. The method of claim 30, whereinthe decoder imputes or predicts a molecular phenotype selected from geneexpression, chromatin accessibility, DNA methylation, histonemodifications, recombination hotspot, genomic landing locations fortransgenes, transcription factor binding status, or a combinationthereof.
 49. The method of claim 30, wherein the imputed or predictedgenotype is a plurality of haplotypes.
 50. The method of claim 30, themethod further comprising the step of: (a) imputing or predicting by thelocal decoder local high-density (HD) SNPs; (b) imputing or predictingby the local decoder local high-density (HD) SNPs or haplotypes of onepopulation based on the decoding of genotypic association data ofanother population; (c) imputing or predicting by the local decoder amolecular phenotype selected from gene expression, chromatinaccessibility, DNA methylation, histone modifications, recombinationhotspot, genomic landing locations for transgenes, transcription factorbinding status, or a combination thereof; or (d) imputing or predictingby the local decoder population coancestry for one or more of thenon-training populations.
 51. A computing device comprising a processorconfigured to perform the steps of the method of claim
 1. 52. Acomputer-readable medium comprising instructions which, when executed bya computing device, cause the computing device to carry out the steps ofthe method of claim 1.